Towards Unsupervised Word Error Correction in Textual Big Data

نویسندگان

  • João Paulo Carvalho
  • Sérgio Curto
چکیده

Large unedited technical textual databases might contain information that cannot be properly extracted using Natural Language Processing (NLP) tools due to the many existent word errors. A good example is the MIMIC II database, where medical text reports are a direct representation of experts’ views on real time observable data. Such reports contain valuable information that can improve predictive medic decision making models based on physiological data, but have never been used with that goal so far. In this paper we propose a fuzzy based semi-automatic method to specifically address the large number of word errors contained in such databases that will allow the direct application of NLP techniques, such as Bag of Words, to the textual data.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Context-based Speech Recognition Error Detection and Correction

In this paper we present preliminary results of a novel unsupervised approach for highprecision detection and correction of errors in the output of automatic speech recognition systems. We model the likely contexts of all words in an ASR system vocabulary by performing a lexical co-occurrence analysis using a large corpus of output from the speech system. We then identify regions in the data th...

متن کامل

Design and implementation of Persian spelling detection and correction system based on Semantic

Persian Language has a special feature (grapheme, homophone, and multi-shape clinging characters) in electronic devices. Furthermore, design and implementation of NLP tools for Persian are more challenging than other languages (e.g. English or German). Spelling tools are used widely for editing user texts like emails and text in editors.  Also developing Persian tools will provide Persian progr...

متن کامل

Context-sensitive Spelling Correction Using Google Web 1T 5-Gram Information

In computing, spell checking is the process of detecting and sometimes providing spelling suggestions for incorrectly spelled words in a text. Basically, a spell checker is a computer program that uses a dictionary of words to perform spell checking. The bigger the dictionary is, the higher is the error detection rate. The fact that spell checkers are based on regular dictionaries, they suffer ...

متن کامل

Unsupervised learning from users' error correction in speech dictation

We propose an approach to adapting automatic speech recognition systems used in dictation systems through unsupervised learning from users’ error correction. Three steps are involved in the adaptation: 1) infer whether the user is correcting a speech recognition error or simply editing the text, 2) infer what the most possible cause of the error is, and 3) adapt the system accordingly. To adapt...

متن کامل

WordNet2Vec: Corpora Agnostic Word Vectorization Method

A complex nature of big data resources demands new methods for structuring especially for textual content. WordNet is a good knowledge source for comprehensive abstraction of natural language as its good implementations exist for many languages. Since WordNet embeds natural language in the form of a complex network, a transformation mechanism WordNet2Vec is proposed in the paper. It creates vec...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014